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Abstract 

The path-difference metric is one of the oldest distances for the comparison of 
fully resolved phylogenetic trees, but its statistical properties are still quite 
unknown. In this paper we compute the mean value of the square of the 
path-difference metric between two fully resolved rooted phylogenetic trees 
with n leaves, under the uniform distribution. This complements previous 
work by Steel and Penny, who computed this mean value for fully resolved 
unrooted phylogenetic trees. 
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1. Introduction 

The definition and study of metrics for the comparison of rooted phylo- 
genetic trees is a classical problem in phylogenetics Ch. 30], motivated 
by the need to compare alternative phylogenetic trees for a given set of or- 
ganisms obtained from different datasets or using different reconstruction 



algorithms Other applications of these metrics include the assessment 



of phylogenetic tree reconstruction methods [18| and the definition of search 



by-similarity procedures on databases [12 



Many metrics for the comparison of rooted phylogenetic trees on the same 
set of taxa have been proposed so far. Some of the first such metrics, defined 
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around 40 years ago, were based on the comparison of the vectors of lengths of 
(undirected) paths connecting pairs of taxa in the corresponding trees. These 
metrics comprise, for instance, the euchdean distance between these vectors 
[gI, 0], the Manhattan distance between them 18], or the correlation between 



them [14|. Similar metrics have also been defined for unrooted phylogenetic 
trees jJlS, 17 1. Let us point out here that, in the rooted case, these metrics 
satisfy the separation axiom of metrics (distance means isomorphism) only 
for fully resolved, or binary, phylogenetic trees, and hence they are metrics, 
in the actual mathematical sense of the term, only in this case; cf. jsj. In 
the unrooted case, they are metrics for arbitrary trees. 

In contrast with other metrics jl, [l^, 16, 17 1, and despite their tradi- 
tion and popularity, the statistical properties of these path-lengths based 
metrics are mostly unknown. For instance, the diameter of none of these 
metrics (either in the rooted or in the unrooted case) is known yet. Steel and 



Penny [17| studied, among others, the distribution of one of these distances 
for unrooted trees: the one defined through the euclidean distance between 
path-lengths vectors, which these authors called the path- difference metric 
(other published names for this metric are the cladistic difference j6| and, 
generically, a nodal distance js], [ll]). In the aforementioned paper. Steel and 
Penny computed the mean value of the square of this path-difference metric 
for fully resolved unrooted trees. The knowledge of this mean value is useful 
in the assessment of a comparison of two trees through this metric, because 
it "provides an indication as to whether or not this measured similarity could 



have come about by chance" [17 



In this paper we compute the mean value of the square of the path- 
difference metric for fully resolved rooted phylogenetic trees with n leaves. 
Although the raw argument underlying our computation is the same as in 
Steel and Penny's paper, the details in the rooted case are much harder than 
in the unrooted case, because of the asymmetric role of the root. We have 
proved that this mean value grows in O(n^); more specifically, it is 

22(n-l) 

- 1) + 2- 





(2{:n-l)\ 
\ n-1 / 

This turns out to be the mean value obtained by Steel and Penny for unrooted 
phylogenetic trees, but with n+1 leaves. A similar relationship between com- 
binatorial values for rooted and unrooted phylogenetic trees arises in other 
problems; for instance, a simple argument shows that the number of rooted 
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phylogenetic trees with n leaves is the number of unrooted phylogenetic trees 
with n + 1 leaves 0, Ch. 3]; also, as we shall see in this paper (Corollary [TT]) . 
the mean value of the length of the undirected path between two given leaves 
in a rooted phylogenetic tree with n leaves is equal to the corresponding mean 
value for unrooted phylogenetic trees. But we have not been able to find a 
clever argument that proves directly this relationship between the mean val- 
ues of the squared path-difference metric, or of the path-length between two 
leaves, in the rooted and unrooted cases, and thus we have needed to compute 
them. 

2. Preliminaries 

2.1. Phylogenetic trees 

In this paper, by a phylogenetic tree on a set S of taxa we mean a fully 
resolved, or binary (that is, with all its internal nodes of out-degree 2), rooted 
tree with its leaves bijectively labeled in the set S. To simplify the language, 
we shall always identify a leaf of a phylogenetic tree with its label. We shall 
also use the term phylogenetic tree with n leaves to refer to a phylogenetic 
tree on a given set of n taxa, when this set is known or nonrelevant. 

We shall represent a path from m to f in a phylogenetic tree T hj u-^v. 
Whenever there exists a path u-^v, we shall say that u is a descendant of 
u and also that u is an ancestor of v. Given a node w of a phylogenetic tree 
T, the subtree of T rooted at v is the subgraph of T induced on the set of 
descendants of f . It is a phylogenetic tree on the set of descendant leaves of 
V, and with root this node v. 

The lowest common ancestor (LCA) of a pair of nodes u,v of a phylo- 
genetic tree T, in symbols LC At{u,v), is the unique common ancestor of 
them that is a descendant of every other common ancestor of them. The 
path difference dxi^u, v) between two nodes u and v is the sum of the lengths 
of the paths LCAt{u,v) u and LCAt{u,v) v; equivalently, it is the 
length of the only path connecting u and v in the undirected tree associated 
to T. It is well-known (for a proof, see j^) that the vector of path differ- 
ences d{T) = (c?t(^, between all pairs of leaves characterizes up to 
isomorphism a phylogenetic tree with n leaves: this property is false if we 
remove the binarity assumption on the trees. 

Let Tn be the set of (isomorphism classes of) phylogenetic trees with n 
leaves. It is well known [9, Ch. 3] that |7i| = 1 and, for every n ^ 2, 

\%\ = (2n-3)!! = (2n-3)(2r2-5)---3- 1. 
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An ordered m-forest on a set S is an ordered sequence of m phylogenetic 
trees (Ti, T2, . . . , T^), each Tj on a set S'j of taxa, such that these sets Si are 
pairwise disjoint and their union is 5*. Let J-'m,n be the set of (isomorphism 
classes of) ordered m-forests on any given set S with 15*1 = n. The cardinal 
of ^m,n is computed (although not explicitly) along the proof of Theorem 3 



m 



0: 



Lemma 1. For every m ^ 1, \J-'m,m\ = o,nd 

rri{n\)YY!'~r^~^{n + I) i2n — m — l)\m 

\Tmn\ = r-, ttt; = r; for every n > m. 

' ' ' {2{n-m))\\ (n-m)!2"-'^ ^ 

Proof. The exponential generating function for the number of rooted phylo- 
genetic trees with n leaves is B{x) = 1 — ^/l — 2x. Then, the exponential 
generating function for the number of ordered forests consisting of a given 
number of trees (marked by the variable y) and a given global number of 
leaves (marked by the variable y) is 

This implies that the number |J?-m^„| of ordered m-forests on a set of n leaves 
is equal to ^ (-B(x)™) This derivative can be easily computed, yielding 
the values given in the statement. □ 

2.2. Hypergeometric functions 

The (generalized) hypergeometric function pFg is defined ^ as 

p / ai, ttp .\ (Qi)fc • • • £^ 

where {a)k := a ■ {a + 1) ■ ■ ■ {a + k — 1) . 

The following lemmas will be used in the next section. 

Lemma 2. 

2-n 1^ 



2 



-n ' 2 / n 
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Proof. To compute the value of 2-F1 ( _^ 52)^^ shall use For- 



mula 15.1.26 in [1[ (see also http : //functions . wolfram . com/07 . 23 . 03 . 0028 . OT] ) 

a, 1-a l\ 2^-'^T{c) 



2 



c r(^^)r(^^^) 



We cannot apply this expression to a = n — 1 and c = — n, because V{—n) 
00. So, instead, we use a standard pass to limit argument: 

/ n — 1, 2 — n 1\ ,. r-, I n — 1, 2 — n 1 



-n ' 2 / e^o + e '2^ 

2i+"-=0Fr(-n + e) 2"-i 



lim 



-0 r (^) r (^) 



n 



□ 



Lemma 3. 

1_„ 2-n, n-1 .1\ 2"-i / ^ , (2n-l)!! 



-n 



'2/ n2 V 2"-2(n - 1)! 



Proof. The hypergeometric series 3F2 ( "'^ ^ ^ ''^ ; - ) can be 



-n, — n ' 2 

written as a function of the hypergeometric function 2-^1 as follows0 

TP ( 1 — n, 2 — n, n — I 1\_ jji/n — 1, 2 — n 1 

3^2 _„ 2^1 _„ ; 



n, —n ' 2 / \ 2 

(^-1)(^_2) /■ 1 
2ri2 2 1 1 _ ^ ,2 

(1) 

(n — 1 2 — n l\ 
— n ' 2 / ~ 

. It remains to compute 2-^1 1 ' ; - . To do it, we shall use 

n \V-n 2) 

the following formulao 
p/a, 3 -a l^ 23-y?r(c 



'2; (a - i)(a - 2) I r - 1) r r (2±f^) r 



iSee 


http : //functions 


wolfram 


com/07 


27 


03 


0118 


01 


2See 


http : //functions 


wolfram 


com/07 


23 


03 


0030 


01 
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Again, we cannot apply this formula to a = n — 1 and c = — n, and thus we 
use a pass to limit argument: 



1 — n ' 2 / e^o \ 1 — n + e 

, 22+"-^0Fr(l -n + e) / (-n-l-e) 
lim 



.0 (n-l)(n-2) \^r(^)r(l-n + |) r (^) T (i±^) 

22+"v^ /. (-n-l-e)r(l-n + e) 2r(l-n + e) 

' lim — ^ lim 



(n 



- i)(n - 2) U-o r (^) r (1 - n + 1) e-o r (^) r 



{n-l){n-2)\ 4v/^ - 1)!0F 

22+"0F + (2n-l)!! 



n - - 2) V 4 (n - 1)!2" 



Replacing 2^^! ^ _^ ^ ^ ^^'^ ( 1 ^ ' ^ 
tion ([T]) by their values given above, we obtain 



3-^2 



l^n, 2-72, n - 1 . 1 \ ^ 2^ _ 2^ ( jn + l) _ (2n - 1)!! 
-n, -n '27 n \ 4 (n-l)!2" 

2"-i 1^ (2n-l)!! 



n2 \^ 2"-2(n-l)!^ 
as we claimed. □ 
Lemma 4. For every real numbers a, 6, 



P 1, a, a + 1/2, 6 . 
2, 2a, 6 + 1/2 



4-^3 



(a-l)(6-l) 

Proof. By definition 



-1)(6-1) [-^ + '^\2a-l, 6-1/2 'IJJ 



p 1, a, a + 1/2, 6 A _ k\{a)k{a + 1/2) k{b)k l_ 
^^\2, 2a, 6 + 1/2 ^'J A.(A; + i)!(2a),(6+l/2), 'fc! 



(a)fc-i(a + l/2)fc_i(6)fc-i _ 
^ A:!(2a),_i(6+l/2),_i " 



Taking into account that 

the expression (*) can be written as 

(a - l)^(a - l/2)i(6- l)t(2a - l)(b - 1/2) 1 



W =E 



fc>l 



(a - l)(a - l/2)(6 - l)(2a - 1)^(6 - 1/2),. A:! 



_ (2^-1) ^^ f( a -1/2, 6-1. 

yielding the formula in the statement. □ 

3. Mean total areas 

For every s e Z"*", the total s-area of a phylogenetic tree T is 

This vahic (or, rather, its s-th root) measures the total amount of evolutive 
history captured by the phylogenetic tree. Let 

I n I 

be the mean value of D'^^^T) for T e 7^ under the uniform distribution 
on Tn. In this section we compute ^{D'^^^)^ and //(D^^^)^. To simplify the 
notations, for every s e Z+ let 

Lemma 5. For every s e Z"*" and for every 1 ^ i < j ^ n, 

TeTn 
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Proof. Let aij be the involutive permutation that interchanges 1 and i, and 
2 and j and leaves the other elements fixed and, for every T G T^, let T„^^ be 
the phylogenetic tree obtained by applying to the leaves in T the permutation 
aij. On the one hand, it is clear that dT{i,j) = dr^. . (1, 2), and, on the other 
hand, since the mapping %, ^ Tn defined by T i— > To-i.j is bijective, we have 
the equality of multisets 



{(iT(i,2) I Te%} = {dT.Jl,2) I T e r^. 

Combining these two observations we obtain 

TeTn TeTn TeTn 

Corollary 6. For every n ^ 2 and for every s e Z+, 
Proof. Using the previous lemma 



□ 



2) (2n-3)!!' 



\%\ (2n-3)!! (2n-3)!! 

□ 

For every i = 1, . . . , n — 1, let q be the cardinal of the set 

{rer„|ciT(i,2) = z}. 

Then, Sn^ = Y^^=2'^^^i- ^^^^ S°^^ ^^'^ ^ suitable expression for 

these coefficients q. 

. . ^ (i-l)(2n-i-2)! 

Proposition 7. q = -rr^ . 

{2{n-i))\\ 

Proof. Let T e 7^ be any tree such that 2) = i; to simphfy the nota- 
tions, let us denote by x the node LCAt{1, 2). Then, on the one hand, the 
paths x-^l and x ~^ 2 have, respectively, j and i — 2—j intermediate nodes, 
for some j — 0, . . . , i — 2, and each such intermediate node is the parent of 
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the root of a rooted subtree of T. Let {ii, . . . , 2^,} be the union of the (pair- 
wise disjoint) sets of leaves of these subtrees: notice that i — 2^k^n — 2, 
because each subtree has some leaf and the leaves 1, 2 cannot belong to these 
subtrees. On the other hand, x is the leaf of the phylogenetic tree Tq with 
leaves ({1, . . . , n} \ {1, 2, ii, . . . , ik}) U {x} obtained by collapsing the subtree 
of T rooted at x into a single leaf x. 

So, the tree T is determined by a subset {ii, . . . , 4} of {1, . . . , n}, with 
i—2 ^ k ^ n—2, a phylogenetic tree Tq on ({1, . . . , n}\{l, 2,ii, . . . , ik})u{x} 
(and hence with n — k — 1 leaves), an ordered {i — 2)-forest (Ti, . . . , Tj_2) on 
{ii, . . . , ik}, and an index j G {0, 1, ... , i—2}. The tree T is obtained by start- 
ing in the leaf x of Tq two new paths (x, Vj, . . . , f i, 1) and (x, f j+i, • . . , fi-2, 2) 
of lengths j + 1 and i — j — 1, respectively, and then adding to each inter- 
mediate node vi in these paths an arc with head the root of the tree Ti 
(cf. Fig.©. 




Figure 1: The structure of a tree T with dT{l,2) = i. 
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This shows that Cj can be computed as 
Cj = (number of ways of choosing the k nodes ii, . . . , i^) 

fc=i-2 

■(number of ordered {i — 2)-forests trees on {ii, . . . , i^}) 
■(number of ways of choosing j between and i — 2) 
■ (number of phylogenetic trees with n — k — 1 leaves) 

k=i-2 ^ ^ 



/A V n-2 \(k + i-l)\U; Jk + i-l + l) ^ 
+ (. - 1)(. - 2) g (, ^ , _ J k2n - 2k - 2. - 3)!!, 

Applying the hypergeometric series lookup algorithm given in [13), p. 36], 
we obtain 



k=0 



and hence 



.(2„-2.-i+i(,-2)(«-.).^3(^: '"i 



If we apply Lemma H] with a = i/2 and 6 = i — n + 1, we obtain 



e, = (>-l).("-2)(2„-2>-l)..3F,( ^;_/2-Y;. -M). (2) 



The value of 3F2 ^ ^''^^ ^ |' ^ * ^ ! 1^ is computed in 0, Form. (2), 

p. 9]: 



3-^2 



V2-1, t/2-1/2, i-n \ _ (^ - 2)!r(l/2 + i - n)r(n - V2)r(3/2 - i/2) 
i-l, i-n + l/2 'J r(V2)(n-2)!r(3/2 + 2/2-n)r(l/2) ' 
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Now, using that 

r(x) = {x-l)T{x-l), 

we can write 

r(3/2-V2) = tl)!in|Ig^±l^r(3/2 + z/2 - 
r(l/2) = Ht^^g^^^r(l/2 + z-n), 

r(.-v2) ^ n-^^;r-^^^ r(V2). 



Then, using these formulas, the expression for 3F2 
can be simphfied, yielding 



i/2-1, i/2-1/2, i-n . 
i-n+1/2 



3-^2 



/z/2-1, V2-1/2, ^-n .A _ (z-2)!(2r2-z-2)! 

z-n + 1/2 ' j (n-2)!(2n-2z- 1)!!S 



(n - 2)!(2n - 2z - 1)!!2"-* 

Replacing this expression in equation ([2]), we finally obtain 

_ (z - l)(2n - z - 2)! 
~ (2(n-z))!! ' 

as we claimed. □ 

Proposition 8. Si^^ = 2"^i(n - 1)! = {2n - 2)!!. 

Proof. By Proposition [7] we know that 

(1) _ ^ i{i - l){2n - t - 2)\ _ ^ {i + 2){i + l){2n-t-4)\ 
" (2(n -.))!! (2(n-^-2))!! 

If we compute this sum using again the algorithm given in |[13i, p. 36], we 
obtain 

cm 23-(2n-4)! / 3, 2-r^.2^ 
- (n-2)! '^\i-2n ' 2| 

Replacing in this equality 2-^1 ^ 4 _ 2ri ^ ^ ! 2^ by its definition, we ob- 
tain 

c(i) = o2-n V (fc + l)(fc + 2)(2n-fc-4)! , ^ ^ (^-fc-l)(n-fc)(n + fc-2)! , 

^ (n-A;-2)! ^ A;! 

fc=0 ^ ^ fc=0 
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This sum can be computed using again the algorithm given in [13!, P- 36], 
yielding 

n — 1, 2 — n 



S(^) = n\,F, f ^ - 1' 

n \ —IT- 

By Lemma [21 we conclude that 



;i/2 . 



5<(1) ^ _ on-1 



n 



2"-\n-l)\ = (2n-2)!!, 



as we claimed. 

Proposition 9. SL^^ = 2 ■ {2n - 1)!! - {2n - 2))!!. 

Proof. By Proposition [3, we have 



□ 



n~2 



c(2) _ 'SP -2 _ i^{i-l){2n-i-2y. _ ^ 
^„ -2^^ (2(n-0)!! 



(i + 2)2(i + l)(2n-i-4)! 



i=2 



i=2 



i=0 



{2{n-i-2))\\ 

Using again the algorithm given in 13|, p. 36], the value of the sum Sn ^ isi 



_ 2^-"(2n-4)! ^^3, 3, 2 - n . ,\ _ {k + 2f{k + l){2n - k - A)\ 

o„ — 



(n-2)! 



-3F2\ 2, 4_2n 



fc=0 



{n-k- 2)! 



n-2 



fc=0 



Using once again the algorithm given in [l3j, p. 
computed as: 



the last sum can be 



S(^) = n-n\,F2 
By Lemma [3], we obtain 
5(2) = (n-l)!2"-^ (-1 



1 — n, 2 — n, n — 1 _^ 
—n, —n ' 2 



{2n-l)\\ 



2- (2n- 1)!! - (2n-2))!!, 



2"-2(^ - 1)! 
as we claimed. 

Applying Corollary El we obtain the following total areas. 



□ 
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Corollary 10. 

^""V2y (2n-3)!!' ^""V2; (2n-3)!! 

Sn'^ can also be used to compute the mean value of the length of the 
undirected path between two given leaves in a phylogenetic tree. 

Corollary 11. For every i,j G {1, . . . ,n}, i ^ j, the mean value of dxiiyj) 
for T & Tn under the uniform distribution is 

22(ra-l) 



Proof. 

Y.TerjT{i,j) S^n^ {2n-2)\\ 



|r„| (2n-3)!! (2n-3)!! 

(2n-2)!!2 22"-2. (n- 1)!2 22("-i) 



(2n-3)!!(2r2-2)!! (2n - 2)! (^^^-i) 



□ 



In the unrooted case, this mean value is proved in (iTj to be 22("-2)/(2(n-^2)^ _ 

4. Mean path-dilference distance 

The path- difference distance between a pair of phylogenetic trees T, T' e 
Tn is 

5iTX) = I E idr{^,J)-dA^,J)y. 
y i<:?;<jr<n 

Lemma 12. The mean value of6{T, T'Y , with T,T ^ Tn, under the uniform 
distribution on T^, is 



li{S\ = 2r] I 4(n-l) + 2 



2N 

22(n-l) / 22(n-l) ^ 



(^2{n-l)j 1 ^2(n-l)^ 
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Proof. By definition 



2 



^( 5Z 5Z idTit,jf + dT'it,j)''-2dTii,3)dT'it,3)) 

^' l^i<j^nT,T'eTjv 



and then, using Lemma O 



2 



2j \ {2n-3)\\ V(2n-3)!! 



If we replace 5*^^^ and Sn^ by tlieir values given in Propositions [8] and [U 
we obtain 

fn\ / 2(2n-l)!!-(2n-2)!! /(2n-2)!! 
MOn =2( J ( ^,:-^,,T^ l(2;7^ 

(2n - 2)!!x 2' 

{2n-3)\\ 



(2n -3)! 

21 "1 r4„-2-P"-2'' 



^2y V (2n-3)! 
Applying finally (see Corollary [TTil 

(2n-2)!! 22{"-i) 



(2n-3)!! (^^;r/y 
we obtain the expressions in the statement. □ 

The value of /x(5^)ri obtained by Steel and Penny in the unrooted case 
was 



/i(5^)„ = 2(^ ) I 4(n-2) + 2- 



2 ' 

22(n-2) / 22{n-2) ' 



2 / I ^ ^ ('2(n-2)\ I /2(n-2)\ 

V n-2 / \ V ri-2 / , 

Using Stirling approximation, both mean values are equivalent to 

^j((4-7r)n- 0m)- 
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